Mutual Information

Measuring Entropy Reduction: How much reduction in the entropy of X can we obtain by knowing Y?

I(X;Y)=H(X)H(X|Y)=H(Y)H(Y|X)=H(X)+H(Y)H(X,Y)




Properties

Formally:

I(X;Y)=YXp(x,y)log(p(x,y)p(x)p(y))dxdy

Expressed by KL Divergence:

I(X;Y)=DKL(p(x,y)p(x)p(y))

MI measures the divergence of the actual joint distribution from the expected distribution under the independence assumption.

Furthermore,

I(X;Y)=DKL(p(x,y)p(x)p(y))=p(x,y)logp(x,y)p(x)p(y)dxdy=p(x|y)p(y)logp(x|y)p(x)dxdy=p(y)(p(x|y)logp(x|y)p(x)dx)dy=EY{DKL(p(x|y)p(x))}

Thus, MI can also be understood as the expectation of the Kullback–Leibler divergence of the univariate distribution p(x) of X from the conditional distribution p(x|y) of X given Y: the more different the distributions p(x|y) and p(x) are on average, the greater the information gain.

Reference

Mutual Information: https://en.wikipedia.org/wiki/Mutual_information
Text Mining: https://www.coursera.org/learn/text-mining